Recognition of Genetic Conditions After Learning With Images Created Using Generative Artificial Intelligence

Key Points Question When compared with other education methods, is exposure to images developed using generative artificial intelligence associated with improved recognition of Kabuki and Noonan syndromes among pediatric residents? Findings In this comparative effectiveness study, generative methods were used to create images of fake but realistic-appearing individuals with Kabuki and Noonan syndrome. Through online surveys, generated images were found to help residents recognize these syndromes and improved their confidence in this area compared with text-only descriptions, although real images were most helpful. Meaning These findings suggest that generative artificial intelligence could supplement genetics education for pediatric residents by helping teach the recognition of rare conditions.

This supplementary material has been provided by the authors to give readers additional information about their work.eFigure 1.Generation of unaffected and syndromic faces.First, real images were aligned so that all the faces are roughly in the same size and orientation.StyleGAN2-ADA was finetuned with disease, age, and gender as labels; this provides us more control at generating images within a specific category.The right side shows the transformations of fake unaffected images into Kabuki syndrome.ADA: adaptive discriminator augmentation.

Detailed methods for generative AI images
Our image generator is based on Nvidia StyleGAN2-ADA.StyleGAN2-ADA creates a fake random image by converting a random vector into this image; 1 a label vector (default size 512x1), if available, can also be used with this random vector, providing the user more control for generating images with certain criteria (e.g., face images of only young individuals).The default Nvidia StyleGAN2-ADA trained on FFHQ dataset did not use labels.We followed the same approach as in our previous work, 2,3 and finetuned the default StyleGAN2-ADA on our own labeled datasets.Our label vector of size 512 consists of three subparts: a vector of size 256 indicating the disease label, a vector of size 128 indicating the age group, and a vector of size 128 indicating the gender.Hence, the additional parameters to be trained are the disease embedding 256 x 11 (we used 10 genetic conditions and one set of unaffected images training (see: https://github.com/datduong/stylegan3-syndromic-faces)),the age embedding 128 x 5 for five age groups (under 2 years old, 2-9 years old, 10-19 years old, 20-34 years old, and >35 years old), and gender embedding 128 x 2 for female and male.Due to low sample sizes, we could only train a small number of embeddings.Hence, the age and gender embedding are shared across all images; that is, these embeddings are not specific to a particular genetic condition.
Although we evaluated the GAN application on just KS and NS, we used images of the other conditions during fine-tuning (eTable 1).Since people with some of these conditions can have similar facial features, using images of people with additional conditions allows us to generate more specific images of people with the genetic conditions of interest.
Because our dataset is small, we fine-tuned Nvidia StyleGAN2-ADA of images size 256 x 256 instead of the higher resolution 1024 x 1024.Following Nvidia image pre-processing, we aligned all the faces so that all images have roughly similar head sizes/orientations and eye positions.We further removed the background by setting these pixel values to zero; this approach helps remove some potential GAN artifacts such as artifactual strands of hair or backgrounds.Since we have different numbers of images of each condition, during fine-tuning we applied sample weighting so that images from each disease have roughly uniform weights (see: https://github.com/datduong/stylegan3-syndromic-faces).
We fine-tuned StyleGAN2-ADA on our images together with their corresponding labels (eFigure 1).These labels provide more control to generate fake images of a person with a particular genetic condition, age, and gender, and, for the "transformation strips" (see below), allow us to alter these fake images to resemble unaffected individuals by manipulating the disease embeddings.We generated single images as well as what we called "transformation strips", which depict an individual changing from unaffected to affected.This was inspired by previous work we had done using GANs to depict disease progression, such as cutaneous findings in neurofibromatosis type 1. 2 In preliminary testing, medical genetics residents reported transformation strips helped them recognize genetic conditions.
After fine-tuning, we selected a random vector and the label embeddings to generate a random image of a person representing a particular genetic condition, age group, and gender.To be more specific, such as to generate an image of young female individual with KS (and likewise NS), we would provide StyleGAN2-ADA with a random vector and the label embeddings of KS, young child, and female.To make this generated individual with KS look like a person without this condition (for the transformation strips described above), we interpolated the embeddings of KS and unaffected, while keeping the corresponding random vector, the age, and gender embedding unchanged.

Detailed statistical analyses
Logistic regression for clustered data was applied since each image was seen by multiple different participants.Following the notations in Miglioretti et al. 4 , we have the following model: , ,  , , and  , are binary dummy variables indicating to which intervention the participant i was exposed prior to analyzing the image j.
(  ) denotes the logit transformation of the binary output indicating if a participant i correctly identifies the image j.   is the metric of interest (discussed below).
We analyze the KS and NS surveys separately.For each of these two conditions, we further partition the survey data into two subsets.For example, when analyzing KS surveys, we partition the questions into a subset containing only KS images, and then another subset containing only the other conditions.This allows us to measure how the inventions affect the sensitivity   (  ) = (  = 1|  = 1,   ) and specificity   (  ) = (  = 0|  = 0,   ), where   indicates where or not the image j is indeed a patient with KS syndrome.
The intercept    corresponds to the marginal (or population-averaged) effect of the text-only intervention.The regression coefficients    ,    , and    estimate how the log odd changes with respect to the text-only intervention.These parameters were estimated using generalized estimation equations (GEE) via the R library geepack. 5Table s1 and s2 show the sensitivity outcome of KS with and without the text-only intervention as the intercept term.Table s4 shows the sensitivity outcome of NS.Tables s3 and s5 show the noninferiority for each survey sensitivity analysis.s2 and s4).
We applied the same regression technique as above to analyze the how confidence level correlates with accuracy within each intervention.Separately in each intervention, we test whether there is a positive association between confidence level and accuracy (Table y2 and y3).s7.Logistic regression measuring the confidence level versus accuracy score within each educational intervention for KS survey.GEE method was applied to handle clustered data.No to some confidence (no/low) is set as the intercept, and confident to highly confident (med/high) is set as the slope.In all 4 interventions, the intercept term is statistically insignificant, indicating that when confidence is low, the participant randomly guesses the syndrome (e.g., guessing between "KS" and "other condition").In the text-only intervention, the regression coefficient of med/high level is statistically insignificant, hence, even when the participant is confident, their log odd is still not different from random guessing.This is not the case for real-image, GAN, and transformation intervention.The 3 image interventions have significant med/high confidence-level coefficients; hence, when average participant was confident, they selected the correct syndrome with log odd statistically better than random guessing.

eFigure 1 . 2 . 3 . 5 . 6 .
Generation of Unaffected and Syndromic Faces eMethods.Supplemental Methods eFigure Schema of Kabuki Syndrome Surveys eTable Logistic Regression for Clustered Data Fit With and Without the Text-Only Intervention as the Intercept Term eFigure 3. Pre-to Post-Intervention Change in Important Diagnostic Facial Features of Kabuki Syndrome eFigure 4. Pre-to Post-Intervention Change in Important Diagnostic Facial Features of Noonan Syndrome eTable 4. Non-inferiority Analysis for KS and NS Image Interventions eTable Survey Completion Rates eTable Average Survey Accuracies for the 8 Other Condition Images Stratified Survey Type eReferences

eFigure 3 .
Pre-to post-intervention change in important diagnostic facial features of Kabuki syndrome.All interventions, including the text-only, involved a decrease in the "unsure" response and an increase in responses indicating the typical features associated with Kabuki syndrome (affecting the eyes, ears, and nose).GAN: generative adversarial network.

eFigure 4 .
Pre-to post-intervention change in important diagnostic facial features of Noonan syndrome.All interventions, including the text-only, involved a decrease in the "unsure" response and an increase in responses indicating the typical features associated with Noonan syndrome (affecting the eyes and ears Table s6 shows the specificity outcomes of both KS and NS.For noninferiority test, we remove the text-only surveys, and analyze only the responses for the other three interventions.Similar to Equation 1, we fit the model:    is the intercept, measuring the marginal log odd of success in the real image intervention.   , and    estimate how the other two interventions change the log odd with respect to the real image intervention.We again apply R library geepack to approximate the regression coefficients (Tables

Table s1 .
Sensitivity analysis in KS survey, Equation1was fitted with all the regression coefficients, and the intercept    is found to be statistically insignificant.

Table s2 .
Sensitivity analysis in KS survey.Equation 1 was fitted to the survey subset containing only questions showing other syndromes in the KS surveys.The intercept    was removed because it was not found to be statistically significant.Real, GAN, and transformation strip interventions (   ,    , and   ) statistically improve the odd of recalling KS images, as compared to random guessing.

Table s3 .
Noninferiority test on sensitivity for KS.Equation 2 was fitted to the survey subset containing only questions showing KS patients, and using only responses in the real image, GAN image, and transformation strip intervention.   and    have nonsignificant p-values, indicating that, with respect to the real image intervention, GAN and transformation intervention do not statistically change the odd of recalling KS images.

Table s4 .
Sensitivity analysis in NS survey.Equation 1 was fitted to the survey subset containing only questions showing other syndromes in the KS surveys.Text-only intervention has a statistically positive effect    , indicating that the participants on averages may already be more familiar with NS.Only the real image intervention    statistically improves the odd of recalling NS images with respect to the text-only intervention.

Table s5 .
Noninferiority test on sensitivity for NS.Equation 2 was fitted to the survey subset containing only questions showing NS patients, and using only responses in the real image, GAN image, and transformation strip intervention.With respect to the real image intervention, GAN and transformation intervention (   and    ) do not statistically change the odd of recalling NS images.

Table s6 .
Specificity analysis in the KS and NS surveys.For each disease type, Equation 1 was fitted to the survey subset containing only questions showing the other syndromes.None of the intervention is statistically significant, indicating that, with respect to the text-only intervention, none of the interventions statistically affect how the average participant recognizes the other syndromes.

Table s8 .
Logistic regression measuring the confidence level versus accuracy score within each educational intervention for NS survey.GEE method was applied to handle clustered data.In each intervention, confidence level is positively correlated with accuracy.

eFigure 2. Schema of Kabuki syndrome surveys.
Each condition evaluated had 4 arms (4 different educational interventions): text describing facial characteristics (no images), text plus 5 real images, text plus 5 GAN single images, and text plus 5 transformation strips.After the educational intervention, participants were asked to classify 20 images (12 syndrome of interest and 8 other syndromes).The same schema was used for Noonan syndrome.GAN: generative adversarial network.eTable 3. Logistic regression for clustered data fit with (left column) and without (right column) the textonly intervention as the intercept term.When fitting with the intercept term, the OR compares the textonly effect against random guessing.The OR of the other interventions were then compared against text-only.Without the intercept term, the reference OR is set as 1 to represent random guessing OR for binary output.The OR of the other interventions were then compared against random guessing.
). GAN: generative adversarial network.eTable 4. Non-inferiority analysis for KS and NS image interventions.OR were calculated for the real images compared to random chance (e.g.viewing real images are associated with a 1.52 times higher odds of accuracy than random guessing).Generative images were compared to real images for each condition.Although, generative images have a lower OR than real images, these differences were not statistically significant.OR: odds ratio; CI: confidence interval; GAN: generative adversarial network.Survey completion rates.A survey was considered complete if a response was entered for all 20 test images.The classification portion of the survey was considered started if a response was entered for at least 1 test image.GAN: generative adversarial network.Comparing association strength between image-interventions and specificity with respect to the text-only intervention.Average accuracy (true negative rate) was computed by averaging the accuracy of each participant every time an image of a syndrome other than the syndrome of interest was shown for both the KS and NS surveys.Odds ratios (OR) of text-only intervention are significantly higher than random guessing, and the other three interventions are not significantly different from the text-only intervention.CI: confidence interval; GAN: generative adversarial network.